在过去的几年中,基于卷积的神经网络(CNN)的人群计数方法已取得了有希望的结果。但是,对于准确的计数估计,量表变化问题仍然是一个巨大的挑战。在本文中,我们提出了一个多尺度特征聚合网络(MSFANET),可以在某种程度上减轻此问题。具体而言,我们的方法由两个特征聚合模块组成:短聚合(Shortagg)和Skip Contregation(Skipagg)。 Shortagg模块聚集了相邻卷积块的特征。其目的是制作具有从网络底部逐渐融合的不同接收场的功能。 Skipagg模块将具有小型接受场的特征直接传播到具有更大接收场的特征。它的目的是促进特征与大小接收场的融合。尤其是,Skipagg模块引入了Swin Transformer块中的本地自我注意力特征,以结合丰富的空间信息。此外,我们通过考虑不均匀的人群分布来提出基于局部和全球的计数损失。在四个具有挑战性的数据集(Shanghaitech数据集,UCF_CC_50数据集,UCF-QNRF数据集,WorldExpo'10数据集)上进行了广泛的实验,这表明与先前的先前的尚未实行的方法相比,提出的易于实现的MSFANET可以实现有希望的结果。
translated by 谷歌翻译
现有的基于深度学习的全参考IQA(FR-IQA)模型通常通过明确比较特征,以确定性的方式预测图像质量,从而衡量图像严重扭曲的图像是多远,相应的功能与参考的空间相对远。图片。本文中,我们从不同的角度看这个问题,并提议从统计分布的角度对知觉空间中的质量降解进行建模。因此,根据深度特征域中的Wasserstein距离来测量质量。更具体地说,根据执行最终质量评分,测量了预训练VGG网络的每个阶段的1Dwasserstein距离。 Deep Wasserstein距离(DEEPWSD)在神经网络的功能上执行的,可以更好地解释由各种扭曲引起的质量污染,并提出了高级质量预测能力。广泛的实验和理论分析表明,在质量预测和优化方面,提出的DEEPWSD的优越性。
translated by 谷歌翻译
Attention-based neural networks, such as Transformers, have become ubiquitous in numerous applications, including computer vision, natural language processing, and time-series analysis. In all kinds of attention networks, the attention maps are crucial as they encode semantic dependencies between input tokens. However, most existing attention networks perform modeling or reasoning based on representations, wherein the attention maps of different layers are learned separately without explicit interactions. In this paper, we propose a novel and generic evolving attention mechanism, which directly models the evolution of inter-token relationships through a chain of residual convolutional modules. The major motivations are twofold. On the one hand, the attention maps in different layers share transferable knowledge, thus adding a residual connection can facilitate the information flow of inter-token relationships across layers. On the other hand, there is naturally an evolutionary trend among attention maps at different abstraction levels, so it is beneficial to exploit a dedicated convolution-based module to capture this process. Equipped with the proposed mechanism, the convolution-enhanced evolving attention networks achieve superior performance in various applications, including time-series representation, natural language understanding, machine translation, and image classification. Especially on time-series representation tasks, Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly, achieving an average of 17% improvement compared to the best SOTA. To the best of our knowledge, this is the first work that explicitly models the layer-wise evolution of attention maps. Our implementation is available at https://github.com/pkuyym/EvolvingAttention
translated by 谷歌翻译
This paper is about an extraordinary phenomenon. Suppose we don't use any low-light images as training data, can we enhance a low-light image by deep learning? Obviously, current methods cannot do this, since deep neural networks require to train their scads of parameters using copious amounts of training data, especially task-related data. In this paper, we show that in the context of fundamental deep learning, it is possible to enhance a low-light image without any task-related training data. Technically, we propose a new, magical, effective and efficient method, termed \underline{Noi}se \underline{SE}lf-\underline{R}egression (NoiSER), which learns a gray-world mapping from Gaussian distribution for low-light image enhancement (LLIE). Specifically, a self-regression model is built as a carrier to learn a gray-world mapping during training, which is performed by simply iteratively feeding random noise. During inference, a low-light image is directly fed into the learned mapping to yield a normal-light one. Extensive experiments show that our NoiSER is highly competitive to current task-related data based LLIE models in terms of quantitative and visual results, while outperforming them in terms of the number of parameters, training time and inference speed. With only about 1K parameters, NoiSER realizes about 1 minute for training and 1.2 ms for inference with 600$\times$400 resolution on RTX 2080 Ti. Besides, NoiSER has an inborn automated exposure suppression capability and can automatically adjust too bright or too dark, without additional manipulations.
translated by 谷歌翻译
Mainstream image caption models are usually two-stage captioners, i.e., calculating object features by pre-trained detector, and feeding them into a language model to generate text descriptions. However, such an operation will cause a task-based information gap to decrease the performance, since the object features in detection task are suboptimal representation and cannot provide all necessary information for subsequent text generation. Besides, object features are usually represented by the last layer features that lose the local details of input images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms input image into descriptive sentences in one stage. As a result, the task-based information gap can be greatly reduced. To obtain rich features, we use the Swin Transformer to calculate multi-level features, and then feed them into a novel dynamic multi-sight embedding module to exploit both global structure and local texture of input images. To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. Finally, OSIC can obtain rich and useful information to improve the image caption task. Extensive comparisons on benchmark MS-COCO dataset verified the superior performance of our method.
translated by 谷歌翻译
人类相互作用的分析是人类运动分析的一个重要研究主题。它已经使用第一人称视觉(FPV)或第三人称视觉(TPV)进行了研究。但是,到目前为止,两种视野的联合学习几乎没有引起关注。原因之一是缺乏涵盖FPV和TPV的合适数据集。此外,FPV或TPV的现有基准数据集具有多个限制,包括样本数量有限,参与者,交互类别和模态。在这项工作中,我们贡献了一个大规模的人类交互数据集,即FT-HID数据集。 FT-HID包含第一人称和第三人称愿景的成对对齐的样本。该数据集是从109个不同受试者中收集的,并具有三种模式的90K样品。该数据集已通过使用几种现有的动作识别方法验证。此外,我们还引入了一种新型的骨骼序列的多视图交互机制,以及针对第一人称和第三人称视野的联合学习多流框架。两种方法都在FT-HID数据集上产生有希望的结果。可以预期,这一视力一致的大规模数据集的引入将促进FPV和TPV的发展,以及他们用于人类行动分析的联合学习技术。该数据集和代码可在\ href {https://github.com/endlichere/ft-hid} {here} {herefichub.com/endlichere.com/endlichere}中获得。
translated by 谷歌翻译
在复杂的场景中,尤其是在城市交通交叉点,对实体关系和运动行为的深刻理解对于实现高质量的计划非常重要。我们提出了有关交通信号灯D2-Tpred的轨迹预测方法,该方法使用空间动态交互图(SDG)和行为依赖图(BDG)来处理空间空间中不连续依赖的问题。具体而言,SDG用于通过在每帧中具有动态和可变特征的不同试剂的子图来捕获空间相互作用。 BDG用于通过建模当前状态对先验行为的隐式依赖性来推断运动趋势,尤其是与加速度,减速或转向方向相对应的不连续运动。此外,我们提出了一个新的数据集,用于在称为VTP-TL的交通信号灯下进行车辆轨迹预测。我们的实验结果表明,与其他轨迹预测算法相比,我们的模型在ADE和FDE方面分别获得了{20.45%和20.78%}的改善。数据集和代码可在以下网址获得:https://github.com/vtp-tl/d2-tpred。
translated by 谷歌翻译
盲目图像脱毛(BID)仍然是一项具有挑战性且重大的任务。从深度学习的强大合适能力中受益,成对的数据驱动的监督竞标方法取得了巨大进展。但是,配对数据通常是手工合成的,现实的模糊比合成数据更复杂,这使得监督的方法无能为力地建模现实的模糊和阻碍其现实世界的应用。因此,没有配对数据的无监督的深入竞标方法提供了某些优势,但是当前的方法仍然存在一些缺点,例如笨重的模型大小,较长的推理时间以及严格的图像分辨率和域要求。在本文中,我们提出了一个轻巧和实时的无监督的投标基线,称为频域对比度损失约束的轻质自行车(不久,fcl-gan),具有吸引人的特性,即无图像域限制,无图像分辨率限制,25x,25x比SOTA轻,比Sota快5倍。为了确保轻巧的属性和性能优势,设计了两个新的合作单元,称为轻量级域转换单元(LDCU)和无参数频域对比单元(PFCU)。 LDCU主要以轻质方式实现域间转换。 PFCU进一步探讨了频域中模糊域和尖锐域图像之间的相似性度量,外部差异和内部连接,而无需涉及额外的参数。在几个图像数据集上进行的广泛实验证明了我们的FCL-GAN在性能,模型大小和参考时间方面的有效性。
translated by 谷歌翻译
智能交通灯管制系统(ITLC)是一个典型的多机构系统(MAS),包括多条道路和交通信号灯。为ITLCS构造MAS模型是减轻交通拥堵的基础。 MAS的现有方法主要基于多代理深度强化学习(MADRL)。尽管MABRL的深神经网络(DNN)有效,但训练时间很长,并且很难追踪参数。最近,广泛的学习系统(BLS)提供了一种选择性的方法,可以通过平坦的网络在深层神经网络中学习。此外,广泛的强化学习(BRL)在单一代理深层增强学习(SADRL)问题中扩展了BLS,并具有有希望的结果。但是,BRL不关注代理的复杂结构和相互作用。由MADRL的特征和BRL问题的激励,我们提出了一个多机构的广泛强化学习(MABRL)框架,以探索BLS在MAS中的功能。首先,与大多数使用一系列深神经网络结构的MADRL方法不同,我们用广泛的网络对每个代理进行建模。然后,我们引入了动态的自我循环交互机制,以确认“ 3W”信息:何时进行交互,代理需要考虑哪些信息,要传输哪些信息。最后,我们根据智能交通灯控制场景进行实验。我们将MABRL方法与六种不同的方法进行比较,并在三个数据集上进行实验结果验证了MABRL的有效性。
translated by 谷歌翻译
图表卷积网络(GCNS)的方法在基于骨架的动作识别任务上实现了高级性能。然而,骨架图不能完全代表骨架数据中包含的运动信息。此外,基于GCN的方法中的骨架图的拓扑是根据自然连接手动设置的,并且它为所有样本都固定,这不能很好地适应不同的情况。在这项工作中,我们提出了一种新的动态超图卷积网络(DHGCN),用于基于骨架的动作识别。 DHGCN使用超图来表示骨架结构,以有效利用人类关节中包含的运动信息。根据其移动动态地分配了骨架超图中的每个接头,并且我们模型中的超图拓扑可以根据关节之间的关系动态调整到不同的样本。实验结果表明,我们的模型的性能在三个数据集中实现了竞争性能:动力学 - 骨架400,NTU RGB + D 60和NTU RGB + D 120。
translated by 谷歌翻译